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Abstract 

Most classification methods provide either a prediction of class member- 
ship or an assessment of class membership probability. In the case of two- 
group classification the predicted probability can be described as "risk" of 
belonging to a "special" class . When the required output is a set of ordinal- 
risk groups, a discretization of the continuous risk prediction is achieved by 
two common methods: by constructing a set of models that describe the con- 
ditional risk function at specific points (quantile regression) or by dividing 
the output of an "optimal" classification model into adjacent intervals that 
correspond to the desired risk groups. By defining a new error measure for 
the distribution of risk onto intervals we are able to identify lower bounds on 
the accuracy of these methods, showing sub-optimality both in their distribu- 
tion of risk and in the efficiency of their resulting partition into intervals. By 
adding a new form of constraint to the existing maximum likelihood optimiza- 
tion framework and by introducing a penalty function to avoid degenerate so- 
lutions, we show how existing methods can be augmented to solve the ordinal 
risk-group classification problem. We implement our method for generalized 
linear models (GLM) and show a numeric example using Gaussian logistic 
regression as a reference. 

1 Introduction 

The classical problem of discriminating between two classes of observations based 
on a given dataset has been widely discussed in the statistical literature. When 
only two classes are involved, the question of discrimination is reduced to whether 
or not a given observation is a member of a "special" class (where the other class is 
the default state, for example sick vs. healthy). Some classification methods, such 
as Fisher's linear discriminant analysis (LDA), make a decisive prediction of class 
membership while minimizing error in some sense, typically the misclassification 
rate. Other methods, such as logistic regression, provide an estimate of the exact 
conditional probability of belonging to the "special" class given a set of predictor 
variables. Throughout this paper we shall refer to this conditional probability as 
"conditional risk" or simply "risk", although sometimes belonging to the special 
class might actually have a very positive context (e.g. success). 



There are two ways to estimate the conditional risk function: parametric and non- 
parametric. Parametric methods primarily include logit/probit models (Martin 1977 
P] . Ohlsen 1980 [21]) and linear models (Amemiya 1981 [I], Maddala 1986 [HI and 
Amemiya 1985 [2]). Powell (1994 [22]) has a review of non-parametric estimators. 
For a comparison of these approaches and complete review see Elliott and Lieli 
(2006) [7] and more recently Green and Hensher (2010) [lOj. 

The estimation of the exact structure of the conditional risk function comes in handy 
when we wish to make distinctions between observations that are finer than simply 
class membership. However, in realistic scenarios acting upon such estimations 
alone may prove to be difficult. Assessments on a scale of 1:100 (as percentages) 
or finer assessments have little practical use, primarily since the possible actions 
resulting from such information are usually few. For such cases an ordinal output is 
required. It is important to note that this problem is not equivalent to multi-group 
classification in two ways: first, our groups are ordinal by nature and relate to the 
underlying risk; second, the assignment into groups is not given a-priori and greatly 
depends on the selection of model, model parameters and the borders of the intervals 
assigned to each risk group. 

There are two common approaches to creating an ordered set of risk groups to 
match a finite set of escalating actions. The first approach is to create multiple 
models describing the behaviour of the conditional risk function at specific points 
(also known as "quantile regression"); the second approach is to divide post-hoc the 
continuous risk estimation of a known model into intervals. 

The first approach attempts to construct separate models that describe the be- 
haviour of the conditional risk function at specific levels of risk. In linear models 
this approach is known as quantile regression (Koenker & Bassett 1978 [13] )• Manski 
([T5]. [TB] . [T7] ) implemented this notion to binary response models (the equivalent 
of two-group classification) naming it "Maximum Score Estimation". In a series of 
papers he shows the existence, uniqueness and optimal properties of the estimators 
and follows by showing their stable asymptotic properties. The primary justification 
for using this approach is methodological: it demands that we specify in advance the 
levels of risk that are of interest to us (a vector q of quantiles), and then constructs 
a series of models that describe conditional risk at these quantiles. However, as 
we shall demonstrate in section 13. H using risk-quantiles (or conditional probability 
over left-unbounded and overlapping intervals) is not relevant to our definition of 
the problem and even the term "conditional quantiles" is in itself misleading. 

In the second, more "practical" approach, the continuous output of an existing 
optimal risk model (logit, linear or non-parametric) is divided into intervals, thus 
translating the prediction of risk (usually continuous in nature) into large "bins of 
risk" - i.e "low" / "medium" / "high" or "mild" / "moderate" / "severe" (depending on 
context). The final result of this discretization process is a set of ordinal risk groups 
based on the continuous prediction of conditional risk. The primary drawback of 
this approach is that the selection of the classification model and its parameters is 
not performed in light of the final set of desired risk groups. Instead, an "optimal 
model" (in some sense) is constructed first, and the partition into discrete groups is 
performed post-hoc. 
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The primary objective of this paper is to combine the idea of pre-set levels of risk over 
adjacent intervals (rather than risk quantiles) into a standard classification frame- 
work. Instead of constructing multiple models, we offer a process that optimizes a 
single risk estimation model (or "score") paired with a matching set of breakpoints 
that partition the model's output into ordinal risk groups. To that end we define a 
new measure of accuracy - Interval Risk Deviation (IRD) - which describes a model's 
ability to distribute risk correctly into intervals given a pre-set vector r of risk levels. 
We show how this new measure of error can be integrated into existing classification 
frameworks (specifically the maximum likelihood framework) by adding a constraint 
to the existing optimization problem. In addition, we address the more practical 
problem of effectively selecting breakpoints by introducing a penalty function to the 
modified optimization scheme. 

The remainder of this paper is organized as follows. Section |2] defines risk groups 
and a measure of error (IRD) that will be necessary for optimality. Section [3] demon- 
strates the problems of using existing approaches. Section H] formulates a new opti- 
mization problem that will provide accurate, optimal and non-degenerate solutions, 
and section [5] provides a case study where the new framework is applied to logistic 
regression and presents an example. 

2 Definitions 

Let r G [0, 1]^ be an ordered vector of risk levels (0 < ri < r2 < . . . < < 1), let 
X = {Xi, . . . , Xp) be a continuous P-dimensional random vector and let Y G {0, 1} 
be a Bernoulli random variable representing class membership. An Ordinal Risk- 
Group Score (ORGS) for a pre-set risk vector r is a couplet (\E', r) where \I' : M.^ — )■ 
M is a continuous (possibly not normalized) risk predictor, which summarizes the 
attributes of X into a single number (a score), and r G M'^"^ is a complete partition 
of M into T distinct and adjacent intervals (— oo = tq < ti < T2 < . . . < tt-i < 
tt = 00). The couplet (\E', r) classifies observations into risk groups by the following 
equivalence: An observed vector X belongs to the i'th risk group if and only if 
\E'(X) G (Ti_i, Ti] (the intervals are right-side open to avoid ambiguities). The actual 
conditional risk level of the i'th risk group defined by a couplet (\E', r) is: 

R,{^, t)=P{Y = 1\ ^{X) G (r,_i, r,]) (1) 

It is worth noting that score-based classification methods for two classes can be 
described as a special of T = 2 (two risk groups). Such methods look for a single 
breakpoint r G M, and the two resulting intervals (— 00, r], (r, 00) become an ab- 
solute prediction of class membership: "^{X) > t ^ X belongs to class 1. Other 
methods, designed to deal with more than one risk group, typically assign a single 
breakpoint to each risk group (see section [3TT1) . reflecting the idea that the assign- 
ment to risk group is based on thresholds: an observed X is assigned to the «'th 
group if and only if ^^(X) crosses the {i — l)'th threshold {'^{X) > rj_i) but does 
not cross the i'th threshold (^'(X) < Ti). 

Even from the latter definition, it becomes evident that the assignment to groups is in 
fact based on adjacent intervals {(Tj_i, Ti]}J~^ (rather than on right-side open ended 
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intervals defined by thresholds) ans that any breakpoint we set affects the definition 
of two intervals (and hence two risk groups). Although further on in this paper we 
shall discuss separate breakpoints in relation to risk groups in order to demonstrate 
the key problem that arises from the use of adjacent intervals (section[3]2]), the notion 
of assigning intervals simultaneously rather than separate breakpoints should remain 
clear throughout this paper. 

We can now describe the accuracy of an ordinal risk score (\&, r) in relation to a 
pre-set vector r as the overall difference between the pre-defined risk levels of r 
and the actual conditional risk levels -R(\l', r). We define an error measure for risk- 
group classification models which is a parallel of misclassification rate in standard 
classification methods. We name this measure Interval Risk Deviation (IRD): 



On it's own, the very definition of IRD marks a new approach to the evaluation 
of ordinal risk scores. Having a predefined set of risk levels means that any risk 
score (^E',t) we consider as a candidate must uphold IRD,,(\E',r) = (or at the 
very least IRDj,(^,r) < e for a predefined small e > 0). This makes IRD = a 
necessary condition for optimality. In the next two sections we demonstrate how 
the two existing approaches for creating ordinal risk scores do not necessarily fulfil 
this condition, either because of unsuitable definitions of optimality, as is the case 
with risk-quantile based methods, or by ignoring it altogether, as is the case with 
the 2-step approach. 

3 Problems with Existing Scoring Methods 

3.1 Risk-Quantiles (and why we can't use them) 

When first presented with the problem of selecting an optimal model paired with a 
set of optimal breakpoints, our initial idea was to use quantile-oriented models. Such 
models have been extensively studied in econometrics, where they are commonly 
referred to as "ordered choice models" ([2Z], [ID])- The most relevant model in 
that group is Manski's maximum score estimation which defines the optimization 
problem using a set of probabilities over left-unbounded overlapping intervals (or 
rays) in contrast to the definition of the problem over adjacent, non- overlapping 
intervals. 

In order to better illustrate the differences between our definitions and Manski's 
quantile-oriented approach we must first describe quantile oriented models in our 
terms. First we replace the vector r with a vector q of " conditional quantiles" , which 
are in fact the desired conditional probabilities over left-unbounded and overlapping 
intervals. Using Manski's adaptation of quantile regression [15j we can build a 
different set of model parameters for each quantile optimizing: 




(2) 




(3) 
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It is easy to see how this approach can be shghtly modified to match the original 
objective of finding a single model: by coercing the models \E'j to be parallel we can 
create a "master model" ^'(X) and derive appropriate thresholds {Ti}J~i such that: 

^^{X) < ^ ^(X) < r, 

\P{Y = 1 I ^(X) < r,) - q,\ mm z G {1, . . . T} (4) 

Using (jH) we can easily define Qi(\E',r) = P{Y = 1 | \E'(X) < Tj) and the equiva- 
lent Quantile Risk Deviation QRDqi^ ,t) = \\Q{'^,t) — q\\, and look for a model 
with QRD = 0. However, while it is tempting to describe the vector g as a vector 
of ''conditional quantiles" , the term is in itself misleading and should be avoided. 
Figure [1] demonstrates how even under relatively simple assumptions (a one di- 
mensional Gaussian distribution with unequal conditional variances) the function 
Qii"^, t) = P{Y = 1 I X < Tj) is not even monotone in r^. 



Oi>Oo 01=00 Oo>Oi 




Figure 1: Different behaviour of conditional probability over left-unbounded in- 
tervals as a function of the threshold x in the case of one-dimensional Gaussian 
distribution with /iq = —1, /zi = 1 and P{Y = 1) = 0.2. In the left panel 
(Ti = 4, (To = 1, in the middle panel Ui = ctq = 2 (homoscedastic case) and in 
the right panel cti = 1, (Tq = 4. 

Even if we assume strict monotonicity of P{Y = 1 | ^E'(X) < x), for example 
by assuming the strict monotone likelihood ratio property (SMLRP, for details see 
Appendix and thus giving the term "conditional quantiles" a meaningful sense, 
it would still be impossible to apply this approach to optimizing the distribution of 
risk over adjacent intervals. In order to use "risk-quantiles" to solve our problem we 
must first find an a-priori mechanism that will translate any given vector of desired 
conditional probabilities over adjacent intervals r to the equivalent vector of desired 
conditional probabilities over left unbounded and overlapping intervals q. 

However it is easy to show that such an a-priori translation is impossible. Using the 
law of total probability in its conditional form we can calculate for any given R the 
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equivalent Q*^'^-* (actual probabilities over left unbounded intervals): 



Qf\^,r) = PiY = l\^{X)<T,) 



V P(F = 1 I mx) e (r- , ,n mA-)£fc-..r,|.j.(.Y)<r,) 
1 I a-W t lr,_i,T,|) i5(<f (A') < r,) 



(5) 



Or equivalently: 



The same process can be applied to the corresponding vector of risk quantiles q^'^^: 



P{^{X) < n) 



(7) 



' P{^{X)e{n^i,T,]y' P(vI/(X)G(r,_i,r,]) 
As a result for a fixed r) we have: 

IRD^(^,r) = ^ gPP>gM(^,r) = (8) 

The primary problem of using quantiles to define this problem stems from the re- 
lation between r and the resulting g^- By our own definitions the central aspect 
of the problem is the probability over adjacent intervals and not overlapping left- 
unbounded intervals. Therefore the optimization must be performed against a fixed, 
pre-defined vector r. If we wish to construct an analogous quantile-based optimiza- 
tion problem, we must first find the equivalent vector which defines quantile-based 
problem. However equation ([7j) shows that since the relation between r and qr de- 
pends on the specific form of the optimal model \Ef, in order to construct qr we must 
first find the optimal model \l/ for this problem (which is what we are looking for 
in the first place), or in other words the translation r -H- g is possible only once we 
have the optimal solution to the problem. Therefore building an analogous opti- 
mization problem over left-unbounded overlapping intervals can only be done after 
we have the optimal solution. Consequently we cannot use quantile-based models 
to construct an optimal model for the adjacent interval-based ordinal risk-group 
problem. 
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3.2 Lower bounds on Interval Risk Deviation 



Another common practice when building scores for risk groups is to build a model 
^ that is optimal in some sense (e.g. maximizing likelihood or minimizing overall 
miss-classification rate) and then partition the range of "^{X) into adjacent intervals 
the define risk groups. In this section we demonstrate how, under relatively simple 
assumptions, using this approach with existing classification models is not optimal 
for more than two risk groups. 

Using Bayes theorem we can represent R as: 

We assume that {X, Y, satisfies the Strict Monotone Likelihood Ratio Property 
(SMLRP, see appendix |A] for exact definition and details) and that the marginal 
densities fx\Y=k ik = 0, 1) are continuous, strictly positive and finite. By continuity 
and finiteness we can describe the behaviour of -Ri(\E', t) for infinitely short intervals 
{n rj_i): 

hm Ri^r)= hm = I) PmX) e jr^.^r.] \ Y = 1) ^ 

n^r.., ' ^ n^n., P(^(X) G (r,_i, r,]) 

p(M>(x)g(r„n_i]|y=i) ^ ^ (9) 

= p(v = n "'^""^ "'-"'-1 = p(v ^, J'^m\y=n^i- ^^ 



n—Ti-i 



where / is the appropriate density function and the limit is from the right-hand 
side. Similarly for any z G (Tj_i,rj], 

lim lim Ri{^,T)= lim lim /?,(^,r) = P(Y = i) ■^^Wl^=i^^) (iq) 

Although we have stressed the importance of simultaneity when assigning intervals 
to risk groups, in order to understand the implications of fllU]) on optimal model 
selection we must look at the problem from a different perspective. First we fix 
and assume that a given partition r supports a perfect distribution of conditional 
risk up to the {i — l)'th group, meaning that Rji"^, t) = Vj for all j < i. Under these 
conditions, combined with our previous assumptions of continuous, strictly positive 
conditional densities and SMLRP, we can explicitly show that not all values of 
are exactly achievable without introducing some IRD: by theorem lA.ll RA"^. r) is 
strictly increasing in Tj and therefore we can explicitly define a feasibility criterion: 

P(F = 1) ^T'7^V <^- (11) 

Using continuity (which enables us to divide by P(Y = l)/*(x)|y=i(Ti-i)) we can 
transform (|TT|) into a condition on the likelihood ratio A: 

. . . ^(x)|y=i(T:.-i) ^ 1-P(y = l) r, 

Avi,(ri„i) = < -— (12) 

hix)\Y=o{ri-i) P(y = 1) l-n 
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If r does not meet the feasibility criterion f lTT|) . then by (|9]) and strict monotonicity 
of R any selection of Tj > rj_i will have i?j(\E',r) > even if we set the interval 
(rj_i,rj] to be arbitrarily small. The inevitable result that, for the our fixed model 
\E', any choice of r will have IRDr(\I', r) > 0. 

It is important to note that the set of T — 1 inequalities defined by ffTTj) . f|T2|) are 
necessary yet not sufficient conditions for IRD=0. Assume that we have a solution 
t) which satisfies IRD^(\E', r) = 0. Under SMLRP we have X2> Xi ^ A<^ (0:2) > 
A^(xi). Our counter example {^,t) satisfies fi < Ti and Vz > 1 : fi = Ti. By 
SMLRP we have: 



-1 



Therefore f|TT]) is maintained (the other inequalities are not affected). On the other 
hand by theorem lA.ll we have strict monotonicity of R, meaning: 



f) = P{Y = 1 I ^(X) < fi) < P{Y = 1 I ^{X) < n) = r) = 



ri 



and therefore IRDj.(\&,r) > 0. The conclusion is that even under SMLRP we can 
use f[TT]) . f[T^ only as necessary conditions for the feasibility of a given solution and 
that the test of feasibility must be performed using ([T]) and ([2]) directly. 

In order to satisfy the necessary conditions for IRD=0 in the absence of SMLRP we 
can generally require > inf Rii"^, t) (we require strong inequalities to avoid 

{Ti:Ti>Ti^l} 

degenerate zero- length intervals), however for such cases the existence of a closed- 
form expression would depend on the exact distribution of X\Y = k [k = 0, 1). We 
leave the exact formulation of non-SMLRP lower bounds outside the scope of this 
paper. 

The final conclusion is that given two sets of risk categories ri,r2 and a couplet 
(^,ri) which satisfies IRD^ (^E', ti) = 0, we may not be able to find a set of break- 
points T2 which satisfies IRDr2(^l/, T2) = (using the same model \E'). Specifically 
we can now claim that optimal models of existing classification methods (typically 
optimized for r = (0, 1)) are not necessarily feasible for any choice of r. 

The existence of lower bounds on the IRD is perhaps the most counter-intuitive 
result of this paper. The reason why these limitations have not been addressed before 
has to do with the fact that most classification methods use a single breakpoint to 
distinguish between the two groups (r G M) and the issue of degenerate solutions or 
non- feasibility of is avoided altogether. Although the fulfilment of f[TT]) . f[T^ does 
not ensure the feasibility of a given solution, these inequalities are instrumental in 
demonstrating why the solutions from existing methods may not be feasible for a 
different choice of r, and provide an elegant method to disqualify such solutions. 
Once we define our objective as the distribution pre-set risk levels over multiple 
adjacent intervals we must recognize the existence of possible limitations on IRD for 
existing methods and as a result define new conditions for optimality. 



8 



4 Ordinal Risk-Group Classification 



Although the definition of IRD naturally suggests itself as a new criterion for op- 
timality (look for a couplet (\I/,r) such that IRD,.(\1/, r(\l/)) = 0), there are two 
problems with using IRD as a single optimality criterion. First, since our problem 
is a classification problem we must consider some sense of the quality of separation 
between the two classes in order to avoid degenerate solutions. This principle is 
not straight forwardly refiected by the definition of IRD ([2]). Second, our defini- 
tion of IRD and the resulting necessary inequalities ffTTj) do not ensure existence or 
uniqueness of an optimal solution. 

Our practical solution to these problems is to define IRD as a feasibility criterion 
and use it as a constraint in an existing optimization problem. Since we are still in 
the domain of classification problems it would be reasonable to preserve some basic 
concepts, particularly the definition of optimality: We seek a model that on the 
one hand maximizes our ability to discriminate between the two classes, but on the 
other hand distributes risk correctly, meaning that it belongs to the set of feasible 
solutions: 

a(0) = {(^,r): IRD,(^,r) = 0} (13) 

In the event that C is an empty set we would have to reconsider our pre-set r or 
change our method of constructing 

Any classification method we might consider for IRD "augmentation" must satisfy 
several criteria. First, it must provide a continuous output \E'(X), ensuring that 
we have an appropriate output that can be partitioned into intervals (using r). 
This requirement automatically excludes classification methods that do not combine 
the vector of explanatory variables X into a single real- valued score '^{X) before 
making a prediction of risk or class membership (classification trees are an example 
of such excluded methods). Furthermore, we would like to maintain the notion 
that observations with higher scores have a higher conditional risk, and therefore 
require that the output '^{X) is strongly correlated with the conditional risk function 
P{Y = 1 I ^(X) = x). Methods such as Fisher's LDA [S] or SVM for two classes 
do provide a continuous scale and a single breakpoint to predict class membership, 
however these scales are not necessarily correlated with the conditional risk and only 
ensure that a majority of the observations from the special class are on one side of 
the breakpoint. We therefore decided to focus our discussion on risk estimation 
methods that provide a direct estimation of the risk function: 

^ : — > [0, 1] , m{X) = P{Y = 1 I X) (14) 

Finally, in order to simplify our construction we assume SMLRP (see appendix Rl) . 
As we have seen before, this assumption ensures that we have a simple way to 
calculate the lower bounds on IRD, and also ensures that for a given model if 
exists t(\I/) such that IRD,,(\1/, r(\l/)) = then it is unique (see lemma IKT]) . These 
properties enable us to simplify our parameter space by optimizing over \I' alone, and 
provide a simple way to test for the existence of necessary conditions for IRDr(\l/) = 
IRDr(^', r(^')) = and optimize under the constraint Cri"^) = {"^ : TRDr{^) = 0}. 
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4.1 Penalized Optimization 



While the idea of fitting an optimal risk predictor that maximizes class discrim- 
ination is a well defined concept, the requirement of IRDj.(^) = may lead to 
degenerate solutions of t(^) for certain values of r. We demonstrate this prob- 
lem for a simple case of homoscedastic one-dimensional Gaussian logistic regression: 
Let X I F = 1 - iV(/x,a), X \ Y = Q ^ iV(-/^,(T), P{Y = 1) = 1 and the 
model \E'(/3, x) is the one-dimensional logistic function with the parameter mean- 
ing * : R X R ^ [0, 1], *(^, x) = Y^g|y • We set r = (0, 0.5, 1). 

Denoting r(^',/3,t) = (^^(/S, — t), \E'(/3, t)), we use symmetry of the conditional dis- 
tributions around x — and the strict monotonicity of x) in x and (3 to show 
that for any choice of /3, t e R we can minimize error for i — 2: 

R.i^iP, X), t(*, /3, t)) = P(r = 1 I X) e -t), t)]) 
= P{Y = 1\PX e {-/3t, /3t]) = P(Y = 1\ X e (-t, t]) = 0.5 = rs 

Similar considerations ensure that the risk prediction errors are equal on both sides: 

R^{^{P, X),t{^, P, t))^P{Y^l\X< -t) 

= 1 - P(y = 1 I X > -i) = 1 - Rsi^^iP, X),t{^, (3, t)) 

Using Bayes theorem we have: 

P{Y = 1)P(X <-t\Y = 1) 



P(Y ^1\X <-t) 



P{Y = 1)P{X <-t\Y = 1) + P{Y = 0)P(X <-t\Y = 0) 

p(y ^ i)^{-t-i^) 

~P(Y = - //) + P{Y = 0)$(-t + fi) 

_ f p(y = o) ^-t + i,y 

where $ is the CDF of the standard normal distribution. Using the known inequality: 

<$(_^)<M vx>0, (15) 



X + 1/x X 
where (p is the PDF of the standard normal distribution, we show an upper bound: 

Therefore limt_j,oo PiY = 1 | X < —t) = and similarly lim^^oo P(y = 1 | X > 
t) — 1. For any arbitrarily small £ > we can find a sufficiently large t such that 

Pi(*(/3, X), t(^, /3, t)) = P(F = 1 I X < -t) =< e/2 

making the total IRD: 



IRD,(vl/(/3,X),r(vl/,/3,t)) 



\ i=i 
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As a result, for any given /3 the only solution that satisfies IRD=0 is degenerate: 

lim IRD^(^(/3, X), r(^, (3, t)) = 

t—^oo 

There are several alternatives for dealing with this problem. First, we may decide 
that methodologically we do not allow setting ri = or rr = 1- This will ensure that 
the values of r are finite but might still lead to very large or very small intervals, 
depending on the parameters of the model. Alternatively, if our risk estimation 
method uses optimization to fit the optimal model (for example maximizing the 
likelihood function in the case of parametric methods) then we can introduce a 
penalty function Pen : R^~^ — )■ R, which will enable us to balance the properties 
of r (minimal or maximal distance between breakpoints) with the discrimination 
properties of \E'. This means that instead of maximizing or minimizing a target 
function /(\E' | X,Y) we maximize/minimize /(\E' | X, F ) + 7Pen(r) under an IRD 
constraint, where 7 is a tuning parameter that represents the degree of aversion to 
degenerate solutions. 

In cases where the degenerate solutions are encountered we would opt for the use of 
a penalty function. This refiects our understanding that the requirement of "evenly 
spread" breakpoints is relatively subjective and should allow for some discretion as 
to the balance between the ability of the model to separate classes and the result- 
ing interval lengths. By choosing an appropriate penalty function and an aversion 
parameter 7 we enable better fitting of the model according to the circumstances 
at hand, while introducing a relatively small number of additional parameters. On 
the other hand, since IRD represents an absolute measure of the model's quality, 
we believe it must be tightly controlled as the constraint IRDr(\E',r) = on any 
model we might consider. We address the details of constructing this constraint for 
parametric models in the following section. 

4.2 Estimation of Interval Risk Deviation 

So far, we have defined interval risk deviation (IRD) as a property of a score model 
\l/ and the joint distribution of {X,Y). In order to implement the concept of IRD 
in a real-life scenario we must describe a way to estimate IRDj.(\Ef) based on a 
sample of N i.i.d observations from a known P-dimensional multivariate distribution 
J^{0) in the form of a iV x P matrix X and a vector y G {0, 1}^ representing 
known class memberships (depending on the design of the experiment y may or 
may not be a random sample). Focusing on parametric methods, we assume that 
P{Y = 1 I X) = ^(/3,X) where ^ : M*^ x ^ [0,1] is a known function 
and P G M^^ is the set of parameters controlling the shape of the function (e.g. 
generalized linear models [20] where M = P, g is a known, strictly monotone and 
bijective link function and \E'(/3,a;) = g{(3^x)). Having previously assumed SMLRP 
and a closed-from we can simplify our notation by denoting r(/3) = r(^'(/9,X)), 
IRD for a given (3 as IRD^(/3) = IRD^(^(/3, X), r(^(/3, X))) and the constraint set 
Cri(3) = {(3e M^^ : IRD,(/3) = 0} 

Many parametric classification methods solve the problem of estimating /3 from a 
given sample (X, y) by using the maximum likelihood (ML) method. We denote 
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Xj^. the j'th row of the matrix X, making our model-predicted probabihty for the 
j'th observation \l/(/3,Xj .) = P{Y = 1 | Xj .). Assuming random samphng, there 
are two equivalent formulations of the "complete" likelihood function: 

N N 

m,e„p I X,y) = n/x,y(X„.,%) = n/%(X„.)n>S- = %) (16) 

N 

L{f3,e I X,y) = n^(^ = yj I X,,.)/e(X„.) (17) 

i=i 

where fg is the density function of the distribution J^{6) of X and fg^ is the density 
function of the conditional distribution J^y^iOy.) oi X \ Y = Uj. When using f|T6l) we 
must make additional assumptions about the conditional distribution oi X \ Y = k 
and a random sampling process, and as a result our estimator /3 becomes a function 
of the estimators 6o,0i,p. On the other hand using (|T7|) can significantly simplify 
the optimization process. Since our parameter of interest /3 is isolated in the term 
P{Y = 1 I Xj;.) = \['(/3,Xj^.) and does not effect the term fg(X.j^.), we can directly 
maximize the the partial likelihood function Lq, over the values of f3: 

N N 

L^iP \X,y) = l[P{Y = \ X,,) = J] vl/(/3, X,,)^^ (1 - vl/(/3, X,-,))^"^^' (18) 

making the corresponding maximum likelihood optimization problem: 

Pml = argmax Lq,{(3 \ X, y) (19) 



One of the primary advantages of using the second approach for maximum likelihood 
estimation is that it circumvents the need to estimate the parameters of the condi- 
tional distributions, thus making /3 the only estimated parameter. This construction 
also enables a relatively simple extension of the maximum likelihood framework to 
semi-parametric models or non-random sampling (see [23] for such an extension of 
logistic regression). 

Incorporating the IRD constraint into the maximum likelihood framework means 
that for any given /3 we must be able to estimate the conditional probability over 
intervals Ri{f3,T) = P{Y = 1 \ \Ef(/3,X) G (ri_i,rj]) for an arbitrary r from the 
sample (X, y), which we can then use to calculate the estimate for the unique optimal 
r(/3) (which is a function of both /3 and the distribution of X). Alas, it is usually 
difficult to derive Ri{/3, r) directly from the point- wise conditional probability P{Y = 
1 I X). In order to facilitate the estimation of IRD for such cases we use an approach 
similar to f|T6l) but in a different context. As we shall see this will enable us to provide 
parametric estimates of IRD while retaining the convenient structure of our target 
function L^. 

We assume that \E'(/3,X) \ Y = k has a known distribution which we denote 
F{rik{f3)), and that F is continuous. We can now use Bayes's theorem to repre- 
sent R{f3, t) as: 

/?(/3, r) = /?(vI/(/3, X), r) = f 1 + 1^ (20) 
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where: 



(a ^^ _ Pj'^iP^X) € (r,,i,r,] \Y = l) _ F,,(^)(r.) - F,,(^)(r.-i) 

^ P(vI/(/3,X)G(r._i,x.] |r = 0) F,„(,)(r,)-i^,„(,)(r,_i) ^ ^ 

Therefore by estimating the parameters p, riQ{f3), rii{f3) we can calculate the estima- 
tors u{/3,t) and R{/3,t) and use them to find the unique f(/3) which solves: 

IRD,(/3) = IRD,(/3, f(/3)) = (22) 

The complete likelihood function under these assumptions is: 

JV 

L{r^om,vm,P I X,y,/3) = J] X,,))P(r, = (23) 

i=i 

where friy.ip) is the density function of "^{X) | F = ?/j ~ F{riy^{f3)). Since the 
estimation of p, ?7o(/3), '7i(/3) is performed for each (3 independently and only in 
order to verify the compliance with the constraint IRDr(/3) = 0, the estimation of 
these parameters does not effect the value of the target function Lii,(/3 | X, y). We 
use this property together with the separability of the likelihood function to estimate 
p and r]o{(3) , r]i{(3) separately for each beta. For the case of random sampling, the 
parameter p is relatively easy to estimate independently of f3 as: 

p = P{Y = l) = ^^ (24) 

although for other cases, such control studies, we might need additional 

information to correct for biased sampling. For the estimation of '7o(/3), '7i(/3) we 
use our assumptions about F to build an ancillary optimization problem for each /3 
individually, and use the estimates provided by the ancillary problem (conditional 
on the value of /3) to estimate IRD.r(/3,r) (again the target function L,j, remain 
unchanged). The likelihood function for the ancillary problem can be rewritten as: 

L^(r^o(/3),r]i(/3)|X,y,/3)= J] ^(^(^(A X,,)) J] /r,i(/3)(^(/3,X,,)) (25) 

{j-yj=o} {j-vj=^} 

where /,;j.(/3) is the density function of the distribution \E'(/3,X) | F = A; ~ F{rik{f3)), 
and the ancillary maximum likelihood estimation problem is: 

(r)o(/3),r)i(/3)) = argmax L^(r/o(/3), r/i(/3) | X,y,/3) (26) 



The construction of multiple ancillary ML estimation problems can be computation- 
ally demanding, but luckily for many known distributions the formula for the ML 
estimator ?)fc(/3) is known, and in fact it is relatively simple to calculate it directly 
from the known ML estimators of the conditional distribution X \ Y = k J^{Ok) 
as ?7fc(/3) = ili.Pi^k)- This fact significantly reduces the complexity of estimating 
the IRD constraint. For a concrete example of such a case using Gaussian logistic 
regression (which can be easily extended to other GLM instances) see section [51 
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Alternatively it would be possible to use non-parametric estimators or approxima- 
tion. While it is possible to attempt to directly approximate Qi{P) = P{Y = 1 | 
< r,(/3)), P(^(/3,X) < r,_i(/3)), P(^(/3,X) < r,(/3)) and utilize © for 
a direct estimation of R{f3) (in this case the equality is useful since (3 is fixed), it 
would often be more convenient to approximate p and Pr?fe(/3) at {Ti{l3)}J~^ (a total 
of 2T — 1 approximations per (3) and use (l20l) to calculate z>(/3) and the resulting 
IRD estimate. The primary advantage of this approach is that it requires no ad- 
ditional assumptions about the distribution of (X, Y) and can therefore be easily 
extended to other non-ML estimation methods of /3. On the other hand, by using 
non-parametric methods we pay a price both in the quality of our estimates (we ig- 
nore information about the distribution of X) and in the computational complexity 
of our estimation scheme. 

Finally, regardless of our approach to estimation, we recognize the fact that under 
realistic scenarios we will have to use numeric methods for the calculation of t(/3) 
and for the estimation of the required parameters. We therefore set a low threshold 
e and accept /3 as feasible if our estimated IRD satisfies: 

IRD,(/3) = IRD,(/3, f(/3)) = \\R{/3, f(/3)) - r\\ < e (27) 

making our feasibility set C'.r(e) = {/3 G M^^ : IRD.,.(/3) < e} and the penalized 
Ordinal Risk-Group (ORG) optimization problem: 

PoRG = argmax L^(/3 \X,y) + Pen{f{f3)) (28) 

If for our choice of r we have Pipi ^ Cr{e) and the distances between the set of 
breakpoints f^lim) are non-degenerate, then the global (unconstrained) optimality 
of (5lr ensures that it is also the optimal solution of the constrained ordinal problem. 
It may also serve as the optimal solution for the constrained and penalized ordinal 
problem, but that will depend on the selection of the aversion parameter. On the 
other hand, as we've seen in section W% once we have more than a single breakpoint 
we introduce limits of feasibility into the maximization problem and may discover 
that the solution to (l36i) is no longer feasible (JRDt^Plr) > For such cases we 
must define a new constrained optimization problem and look for a new optimal 
solution. We discuss an example for such a case in the following section. 

There are two issues we leave outside the scope of this paper. First, although the 
consistency of constrained ML estimation has been explored in various contexts 
(for example for mixture models [H]), the consistency of the ML estimators under 
the specific constraint of IRD = requires verification. Similarly, although in our 
description of the problem r is a function of /3 and the parameters of J-" (a result 
of the uniqueness demonstrated in appendix [B]), it remains to be verified whether 
the consistency of the estimator /3 ensures the consistency of f(/3). We expect the 
fact that for many cases f{(5) has no analytical solution to further complicate this 
problem. 

Second, in order to measure our estimation errors we require a method for build- 
ing right-sided confidence intervals for IRD based on the distribution of IRD for a 
given /3. Since i/(/3,r(/3)) is a ratio of CDF differences, the process of deriving the 
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distribution of z>(/3,f(/3)) from the distribution of fjk{P) would require several steps 
of approximation, primarily since the result of numeric estimation (even if the 
distribution of ?7fc(/3) is known). Similar problems apply for non-parametric estima- 
tors, although we can see two possible approached for a solution. The first approach 
would be to use an equivalent definition of IRD as IRDr(\E', r) = maxj |-Rj(\E', r) — rj| 
and try to prove a Glivenko-Cantelli p8] type theorem for conditional distributions 
which would describe the necessary conditions ensuring that: 

sup \Pn{Y = 1 \ X e {xi,X2])- PiY = 1\ X e {xi,X2])\ — ^0 (29) 



X2>Xl 



N- 



where P/v is the empirical conditional probability estimator: 

\{j : ^(Xj) e {xi,X2\}\ 

Building on this result we can attempt to derive the asymptotic distribution of fl2Ul) 
and try to construct test that will be the conditional equivalent of the Kolmogorov- 
Smirnov test [28]. The second approach, which is less elegant but more plausible, 
would be to combine (12T]) with the well known asymptotic behaviour of the empirical 
conditional distribution function: 

where Nj^ = |{j • Vj = ^}|- By the central limit theorem [2H], this estimator weakly 
converges to -Fr?fe(/3) pointwise: 

(F(f'=)(t)-F,,(t)) iV(0,F,,(t)(l-F,,(t))) (32) 

However the fact that f{f3) chang function of the sample (and is not a fixed 

quantile t) means that the points where Pj^^'"^ is estimated change as a function of the 
sample {Nk), therefore the nature of the convergence and the resulting asymptotic 
distribution depend on the convergence of t(/3) — ?■ t(/3). We would therefore need 
to find sufficient conditions for: 

r(/3)(^^-) ^ r(/3) 

te^HT(/3)(^'^))-F,,(r(/3)(^'^-))) iV(0,F,,(r(/3))(l-F,,(r(/3)))) 

(33) 

The next step would be to use the strong uniform convergence of -Fjjf^*'' — )■ Fr,,^ (as 
provided by the Glivenko-Cantelli theorem) to approximate F^^^\fi{f3)) as normally 
distributed = Fi^''\n{P)) and = Fi^''\n{P)) (^1 - F4'^*^(f,(/3))) . Combined 
with Donsker's theorem [6], we should be able to estimate the asymptotic distribu- 
tion of the difference of two points of f!^^'''^ (which make both the numerator and the 
denominator of liui) as normallt distributed with mean /i = F^^('rj(/3)) — F^^(fj_i(/3)) 

and variance = (F^,(fi(/3)) - F^, (f,_i (/?))) (l - F^,(f,(/3)) - F,, (f,_i (/?))) . The 
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final step would be to use work by Hinkley [12], which describes the distribution of 
a ratio of two non-correlated normal random variables, to approximate the distri- 
bution of C'i{l3,f{f3)) which we can use to estimate P(IRDr{(3) > e). We note that 
the construction of such test would also mean that we can change the definition of 
our feasibility set to Crie, a) = {(3 e M^^ : P(IRD^(/3) <e)>l-a}. We leave the 
details and proof of these ideas for future papers. 

5 Case study: Gaussian Logistic Regression 

Logistic regression is one of the most studied classification methods in the scientific 
literature and has been widely applied in statistics, scientific research and industry. 
The name "logistic" for the function f{x) = was originally coined by Verhulst 
as early as the 19 'th century, but it was Cox [4j who used it first in the context 
of binary data analysis. The concept of multinomial logistic regression was first 
suggested by Cox (1966) [3J and developed independently by Theil (1969) [26]. The 
link to ordered choice models (ordered logistic regression) was made by McFadden 
in his paper from 1973 [19]. Cramer (2002) [5] has a complete historical review. 

Logistic regression belongs to the group of classification methods that estimate 
class membership probability rather than predict class membership. It is a spe- 
cial instance of a larger group of parametric models called generalized linear models 
(GLM [20j), which extends linear models by allowing the addition of a predefined 
link function g that connects the linear model (/3 G M^) to the response vari- 
able Y (meaning that g^^{Y^ = P'^X) and assuming that the distribution of X is 
from an exponential family. In the case of logistic regression the link function is 

assumed to be the logistic function g{l3'^x) = ^^Jt^ ; making the inverse function 

g~^{p) = logit(p) = log (^i^^ The probability of belonging to the "special class" 
(in the case of 2 classes) conditioned on the r.v X is assumed to be: 

= 1 1 ^) = ^^^^ 

where the assumptions on (X, Y) can be modified to match a wide variety of cases, 
for example to a non-random sampling scheme like case-control studies or semi- 
parametric models [23] . 

For the purpose of demonstrating our method we assume that X is a X x P ma- 
trix representing N i.i.d random samples from a P-dimensional multivariate normal 
distribution. Under this assumption the vector y G {0,1}^ of class memberships 
represents the result of N independent Bernoulli random variables {Yj}J^^. Even 
under these assumptions, the ML problem does not have an analytical least squares 
solution, and is usually solved using numerical maximum likelihood algorithms. The 
log-likelihood function is: 

N N 

IlrW I X, y) = log{LLR{P I X, y)) = J2 " E + (^5) 

j=l 3=1 
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and the logistic regression (LR) maximum likelihood optimization problem is: 



Plr = argmax | X, y) (36) 



As we have noted in section 14.21 the parametric estimation of IRD requires several 
additional assumptions. We assume that the conditional distributions X \ Y = k 
{k = 0, 1) are also multi-variate normal, and in order to achieve SMLRP we assume 
equal conditional covariance. Using terms defined to construct fl28|) our assumptions 
on Gaussian logistic regression translate into the following: 

^^^'^^= l7Slr A-|r = .~M,'iV(,,E) (37) 

As a result: 

logit(vl>(/3,X)) I r = A; ~ N{fik{f3) = fi,, cr\f3) = f3^m (38) 

Conveniently, the ML estimators follow a similar pattern. For the construction of 
the known ML estimators of the distribution of X we denote X'^'^^ as the matrix 
composed of all the lines of X for which yj = k, Ni = J2f=i Vj^ = ^ ~ ^j=i 

and X^ = — ^'"^ as the average of the m'th column of X*^ (m = 1, . . . , P). 
The ML estimators are: 

A, = X^'^ = (xf\...,X;.'^) (A; = 0,1) (39) 

and having assumed equal covariance we use the pooled covariance matrix estimator: 

^S=i ^ ' (40) 

The assumption of multivariate-normal conditional distributions enables us to avoid 
the construction of an ancillary ML problem for each /3 by using the known relation- 
ship between the ML estimators {fi^, fl_^,Ti) and the /3-transformed ML estimators 

tim = P'^lk. <3-(/3) = y^^ (41) 

The strict monotonicity of logit{p) = log (^iz^^ = ^^^(p) P means that we can 
use the equality: 

Pi^{f3,x) e ir,^m,nm\y = k) 

= Pi(3^X e (logit(r,_i(/3)),logit(r,(/3))] \Y = k) 

^ logit(r,(/3))-/Xfc(/3) \ _^ /logit(r,_i(/3)) -/i,(/3)^ 



a(/3) ; V 
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to estimate z/j(/3) as: 



= , J , ... ^!!L,„,( (43) 



/ logit(f,(/3))-Ao(/3) \ _ ^ / logit(f,_i(/3))-Ao(/3) 



Finally, we use the assumption of random sampling to estimate P = jj J2f=i Vj 
utilize fl2Ul) to construct our parametric estimation of IRD for logistic regression. 



5.1 Example: The Wisconsin Diagnostic Breast Cancer 
(WDBC) Dataset 

In this section we bring an example of the sub-optimality of using one of the most 
commonly used classification methods - Logistic Regression (LR) - to solve a rela- 
tively simple ordinal problem. We then provide the Ordinal Risk-Group version of 
Logistic Regression (ORG-LR) solution to the problem and compare our results. 

The dataset we used for this example is the extensively used Wisconsin Diagnostic 
Breast Cancer dataset from the UCI Machine Learning Repository [H], which 
contains the analysis of cell nuclei from 556 patients using digitized images of fine 
needle aspirate (FNA) of extracted breast masses. Since we assumed a continuous 
J-" and equal variance we selected the following features for the construction of our 
models: texture, log area, smoothness, log compactness, log concave points, log 
symmetry and an intercept variable. The final result from the diagnosis ("Malig- 
nant" = 212/ "Non-Malignant" = 344) was used as the dependent variable 
for the logistic regression analysis and ordinal risk group analysis. The code for 
this example was written in the R programming language [23] using internal opti- 
mization algorithms and the Augmented Lagrangian Adaptive Barrier Minimization 
Algorithm (the alabama library) with the constraint IRDr(/3) < e = le-07. 

The first set of risk levels we tested was ri = (10%, 50%, 90%). The estimated 
IRD for the logistic regression solution /3i/j and the matching (non-degenerate) set 
of breakpoints r(/3i/j) = (—2.6918,9.1698) was slightly above our set feasibility 
threshold (IRD.,.^(/3lr) = 7.8776e-05). The second set of risk levels we tested was 
r2 = (20%, 50%, 80%). For this set the solution /^^.r provided by the logistic regres- 
sion was clearly infeasible: on the one hand the interval associated to the 50% risk 
level was clearly degenerate (t2(/3l_r)— Ti(/3j^ij) < l.le-06) and the estimated IRD was 
high above our set threshold (IRDr2(/3L_R) = 0.0014 > e). We proceeded to construct 
a constrained maximum likelihood problem for both ri,r2 as described in section 
[5l For ri an unconstrained problem was sufficient, with IRD < le-07 and a non- 
degenerate t(/3). For r2, the unpenalized ordinal risk-group problem produced de- 
generate solutions, so we added the penalty function Pen(/3, r) = ( "/^r -/^q)^ ~ 1)^5 
which was designed to balance between the distance between the breakpoints ri, T2 
and the distance between estimated class means, and selected a penalty coefficient 
7 = 10. Since in many cases the optimization algorithm converged to a local min- 
imum we randomly sampled 25,000 starting points for each set of risk categories 
we tested. The estimates for logistic regression (LR) and ordinal risk-group logistic 
regression (ORG-LR) for both ri,r2 are summarized in table [TJ 
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ri 


r2 




T T ) 


(JKG-LK 


LK 


UKG-LK 




-87.6641 


12.3928 


-87.6641 


-3.1977 


Texture 


U./864 


0.1894 


0.2864 


0.1844 


\r\(T( A Tf^p 1 


ii.y /U6 


O.o999 


11.9 ^ Ob 


-O.OO (O 


SI m o of n Tl P 


67.7342 


-25.2575 


67.7342 


4.0826 




-i.oiU i 


Q none; 


-i.oiU / 


U.oU ( U 


lofff donprivp Points) 


3 66Q8 


1 1 8390 


3 6698 


1 169 


loff(Symmetry) 


3.2556 


0.7888 


3.2556 


-0.3529 


log-Likelihood 


-45.9909 


-94.8944 


-45.9909 


-311.019 




-2.6918 


-4.4029 


3.117868 


-0.7088 




9.1698 


6.8524 


3.117869 


1.3376 


P(F = l|^(/3,X)<ri) 


9.2925% 


9.9751% 


16.7680% 


19.9857% 


P(F = l|vl>(/3,X)G(ri,r2]) 


50.3289% 


50.0130% 


51.5483% 


50.0116% 


P(F = l|^(/3,X)>r2) 


89.5768% 


89.9870% 


78.7930% 


79.9948% 


IRD 


7.88e-05 


9.69e-08 


0.0014 


3.66e-08 



Table 1: Maximum likelihood (ML) estimators of coefficients, log- likelihood, optimal 
breakpoints r, model predicted probabilities (assuming multivariate normal distribu- 
tion) and IRD estimates for unconstrained logistic regression and ordinal risk-group 
logistic regression (ORG-LR) for ri = (10%, 50%, 90%), ra = (20%, 50%, 80%) 

The differences between the two methods can be further illustrated by looking at 
the distributions of the logit-transformed predicted probabilities {logit{Yi)}f^^ of 
both methods. Figure [2] illustrates the logistic regression solution for ri (top graph) 
and the ordinal risk-group logistic regression solution for ri (bottom graph), and 
figure [3] illustrate the same results for r2 . A comparison of the two graphs in each 
figure shows that for both sets of risk levels, the ORG-LR solution compromises 
the quality of separation between the two classes in order to achieve feasibility, and 
in the more extreme case of r2 reduces separation dramatically in order to avoid 
degenerate solutions. 

In order to validate the results of our new method and compare them to the perfor- 
mance of the logistic regression solution we performed a cross validation study. We 
randomly divided the dataset into two groups: 90% of the patients were randomly 
sampled as a training set, from which a logistic regression model and an ordinal 
risk-group model were constructed, and the remaining 10% were used as a test set 
for the models. We repeated the process with 25,000 random samples and calculated 
the percentage of " Melignant" cases found in each predicted risk group for each of 
the models. The results of the implementation of the training models on the test 
sets for ri = (10%, 50%, 90%) were (0.7173%, 74.6978%, 100%) for logistic regression 
(IRD = 0.07962) and (3.8804%, 57.4549%, 99.9329%) for ordinal risk-group logistic 
regression (IRD = 0.01917). The cross validation results for r2 = (20%, 50%, 80%) 
using ORG-LR were (15.7349%, 52.9507%, 84.9715%) (IRD = 0.0052). Since the 
logistic regression solution was degenerate for r2 we were unable test it with cross 
validation. 
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Figure 2: Logit-transformed predictions and separation between the malignant and 
non-malignant classes of logistic gegression (LR) (top graph) and the Ordinal Risk- 
Group Logistic Regression (ORG-LR) (bottom graph) for ri. The black dotted lines 
mark the matching sets of breakpoints t(/3lj?) and t{Porg-lr)- 



The comparison of the cross validation results from the two methods for ri shows 
that although both models did not perform wery well, the ORG-LR solution out- 
performed the logistic regression solution (IRD is approx. 4 times smaller), in spite 
of the fact that differences in IRD between the two models do not seem significant. 
We estimate that one of the reasons for the of the high absolute deviance in IRD 
of all models and risk levels we tested is that the data is not exactly normally dis- 
tributed (as evident in figures l2|3|l . In addition, specifically for ORG-LR, we suspect 
that the generic algorithms we used for the ordinal risk-group maximum likelihood 
constrained optimization failed to converge to the real global minimum in some of 
the cross-validation iterations. Verifying this hypothesis would require either use of 
more specific optimization algorithms (for example by analytically calculating the 
derivatives of the IRD constraint) or by a very large scale simulation study that 
would require billions of iterations. Both approaches are outside the scope of this 
paper and we leave them for future studies. 
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Logistic Regression (LR) 




Figure 3: Logit-transformed predictions and separation between the malignant and 
non-malignant classes of logistic gegression (LR) (top graph) and the Ordinal Risk- 
Group Logistic Regression (ORG-LR) (bottom graph) for r2- The black dotted lines 
mark the matching sets of breakpoints t(/3lj?) and t{Porg-lr)- 



6 Conclusions 



The exact estimation of the conditional risk function is an important part of practical 
and theoretical research. However the practical application of this information is 
very often in the form of a finite and small set of resulting actions. Although 
conditional risk quantiles provide valuable information, we ultimately want to know 
the risk associated with adjacent non-overlapping intervals in order to create distinct 
ordinal risk groups. As we have demonstrated in section 13. quantile regression is 
not useful for that purpose. Furthermore, section [22] demonstrates that the practice 
of dividing post-hoc the continuous estimate of conditional risk into intervals ignores 
the limitations introduced by the lower bounds on IRD and may produce sub-optimal 
or degenerate solutions. 

Our formulation of the optimization problem, as presented in section IH reflects 
our understanding that while the model's ability to separate the classes remains 
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the key issue, we must introduce both a new constraint and a penalty function 
in order to achieve two additional objectives: an accurate risk distribution and a 
usable partition scheme. While IRD represents an absolute measure of the model's 
quality and must be a constraint on the optimal solution, the "softer" requirement 
on minimal interval length should allow for flexibility in application. We believe 
that a penalty function enables better control and adaptation through the choice of 
function and the aversion parameter. 

Finally, we wish to emphasize the implications of the most counter intuitive result 
of this paper - the existence of limitations on certain risk structures (the vector r) 
in the form of lower bounds on the error rate IRD (equation [2]) . Although most of 
the examples we described are linear or logistic models with Gaussian conditional 
distributions, the existence of lower bounds holds for any continuous risk estimator. 
A re-evaluation of the optimal properties of such estimators in the context of risk 
discretization is therefore required. We leave the specifics of applying these ideas to 
other classification methods as well as proofs of consistency and the construction of 
confidence intervals to future studies. 
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A Appendix: Equivalence of strict monotonic- 
ity of CDF ratios over intervals and the Strict 
Monotone Likelihood Ratio Property 



Let X be a continuous P-dimensional random vector and Y E {0,1} a Bernoulli 
random variable representing class membership. Assume that the conditional dis- 
tributions X \ Y = k [k = 0,1) are also continuous and the conditional densities 
fx\Y=k are finite. Let \1/ : — )■ M be a finite continuous risk predictor. For a 
single observation X, the likelihood ratio of the risk predictor \E' between the two 
alternatives represented by Y is defined as the ratio of the conditional densities: 

P{^iX)=x\Y = l) h(X)\Y=l{x) 

"^"^^ P{^{X) = x\Y = 0) hix)\Y=o{x) 

The Strictly Monotone Likelihood Ratio Property (SMLRP) demands that for a given 
X, Y, \1/ the ratio A,j, (x) is a strictly monotone function of x. It is worth noting 
that by using Bayes theorem we can show that demanding SMLRP is equivalent to 
demanding strict monotonicity of P(Y = 1 \ "^{X) = x) in x: 

P{Y = 1 I = x) = ^{^/''^....^ = . . 1 (44) 



p/i(x) + (l-p)/o(x) 1 



P A*(a;) 



where p = P(Y = 1) and fk{x) = f^i{x)\Y=k{x) are the density functions of the condi- 
tional distributions. This equivalence means that in terms of conditional probability, 
SMLRP is equivalent to strict pointwise monotonicity in the condition, in contrast to 
the requirement of monotonicity over right-expanding intervals in section 13. 2^ which 
we defined as the strict monotonicity of -R(/3,r)j = P{Y = 1 \ "^{X) G (Tj_i,rj]) in 
Ti for any ri_i while > rj_i. 

Theorem A.l. SMLRP Vri_i, Vrj > rj_i R{P,T)i is strictly increasing in Xj. 

Proof. Using our previous definition of R{/3, r)i = P{Y = 1 \ "^{X) G (tj^i, Xj]) and 
Bayes theorem we can represent: 



p{F,{n) - Fi(r._i)) + (1 - p){Fo{n) - Fo(r._0) 1 + '-f^^^^^ 

(45) 

where 

, , fi(Ti)-Fi(ri,i) ,^„. 

-'*(^-"^-' = F„(r.)-F,(n_,) (^«' 

and Fk{x) = -FV(x)|y=fc(a;) are the cumulative distribution functions (CDF) of the 
conditional distributions. The strict monotonicity of -Rj(\l/, r) in for any rj_i < Ti 
is therefore equivalent to the strict monotonicity of 7(c, x) in x for any c, x > c. 

In addition, for two positive, finite, strictly increasing and once differentiable func- 
tions g, h the following equivalence holds: 

q(x) , q'(x)h(x) — q(x)h'(x) q(x) q'ix) , 

ff4 is strictly increasing ^ L, > ^ ff4 < f4-4 47 

h[x) h^[x) h[x) h'[x) 
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Since Fq, Fi meet these requirements, then by fH5l) the strict monotonicity of both 
7^,(c, x) and i?(\l/,r)j is equivalent to following condition: 

• ' F„(i)-F„(c) _ /„(i) ' ' ' 

It is therefore sufficient to show that under the above assumptions of continuity and 
finiteness that the following equivalence holds: 

SMLRP ^ 7vi>(c, x) < A^(x) Vc, c < a; (49) 

Step 1: SMLRP -f^{c,x) < A^{x) Vc < x 
Under SMLRP: 

Vxi > xo > ^ fi{x,)fo{xo) > h{xo)fo{x,) (50) 

The equivalence holds since /o, /i are strictly positive, continuous and finite. Inte- 
grating on Xq over the interval [c, Xi] we have: 

fiixi)foixo)dxo > / fiixo)foixi)dxo , . 

Jc (51) 

^ /i(xi)(Fo(xi) - Fo(c)) > /o(xi)(Fi(xi) - Fi(c)) ^ 7*(c,a;i) < A^(xi) 

and this holds for any Xi G M. In addition setting c = — oo when integrat- 
ing maintains the strict inequalities of f l5ip . and therefore SMLRP also ensures 
strict monotonicity of 7*(— cxd,x) = Fi{x)/Fo{x) and the equivalent monotonicity 
of P{Y = 1 I ^(X) < x) in X. 

Step 2: \fc,c< X 'y^{c,x) < A^{x) SMLRP 
Under our assumptions: 

w ^ ^ ( \ Ic' f^(^)dx h{x)dx 

VXI > Xq> C 7(C, Xl) = r I ^, > rxo n , x , = 7(c, Xq) (52) 

Assuming all functions are continuous and finite we can take c — )■ Xq: 



L /o(a;)c?a; /o(a;o) 



This inequality is strict since for any Xi > Xq there exists e = 2J-^ > such that: 

Qh{x)dx j^l^' h{x)dx f,(xo) 
7(xo,xi)= > + = 7(3^0,3^0 + g) > -77-T (54) 

On the other hand taking c — t- xi (using the same considerations and utilizing the 
fact that for b > a, f{x)dx = — J^^ f{x)dx) and combining with fl55|l we have: 

/i(xi) ^ C/i(^)^^ . . ^ /i(xo) 
/ol^^i) /o(a;)rfx /o(xo) 

□ 
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B Appendix: Uniqueness of r under the Strict 
Monotone Likelihood Ratio Property 



The risk estimation methods mentioned in this paper typically deal only with the 
optimal estimation of (and the breakpoints r are defined post-hoc). Introducing 
the set of breakpoints r as an integral part of the definition of IRD increases in 
the number of parameters that must be estimated simultaneously, resulting in a 
more complicated parameter space (for example we require rj_i < Tj). Although the 
increase in the number of estimated parameters should not be significant (in practical 
scenarios we expect T < 10) the result nonetheless would be longer running times 
for the optimization algorithms. Before we proceed any further it would be useful 
to identify sufficient conditions for the uniqueness of r for a given \E': 

Lemma B.l. If for a given \E' the likelihood ratio A,^(x) = satisfies the 

strict monotone likelihood ratio property (SMLRP), then if there exists such that 
IRDri^^, r<ir) = it is unique. 

Proof. If A^(x) satisfies SMLRP, then by theorem I A. II (appendix R|) R is strictly 
monotone is r^. The rest is by induction: strict monotonicity of i?i means that if 
there exists ti which satisfies i?(\E',r)i = ri, then it is unique. Fixing rj_i, if ffTTl) 
holds (meaning that Ti is "feasible"), then again by strict monotonicity, if there 
exists Ti that satisfies Ri^^, T)i = rj, then it is unique. 

Therefore if (fTTj) holds for all i then only a single r satisfies IRDr(\I', r) = 0. □ 

Corollary B.2. Under SMLRP we can denote r = r(\I') and define IRD using \E' 
alone: 

Ri{^^f) = P{Y = 1\ ^(X) G (r,_i(^),r,(^)]), IRDr{^) = ||i?(^) -r|| (56) 



26 



